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Abstract 

A version of the dueling bandit problem is addressed in which a Condorcet winner may not exist. Two algorithms 
are proposed that instead seek to minimize regret with respect to the Copeland winner, which, unlike the Condorcet 
winner, is guaranteed to exist. The first, Copeland Confidence Bound (CCB), is designed for small numbers of 
arms, while the second. Scalable Copeland Bandits (SCB), works better for large-scale problems. We provide 
theoretical results bounding the regret accumulated by CCB and SCB, both substantially improving existing results. 
Such existing results either offer bounds of the form 0{K log T) but require restrictive assumptions, or offer bounds 
of the form 0{K^ log T) without requiring such assumptions. Our results offer the best of both worlds: 0{K logT) 
bounds without restrictive assumptions. 


1 Introduction 

The dueling bandit problem H] arises naturally in domains where feedback is more reliable when given as a pairwise 
preference (e.g., when it is provided by a human) and specifying real-valued feedback instead would be arbitrary 
or inefficient. Examples include ranker evaluation cm in information retrieval, ad placement and recommender 
systems. As with other preference learning problems m , feedback consists of a pairwise preference between a selected 
pair of arms, instead of scalar reward for a single selected arm, as in the AT-armed bandit problem. 

Most existing algorithms for the dueling bandit problem require the existence of a Condorcet winner, which is an 
arm that beats every other arm with probability greater than 0.5. If such algorithms are applied when no Condorcet 
winner exists, no decision may be reached even after many comparisons. This is a key weakness limiting their practical 
applicability. For example, in industrial ranker evaluation ||6l, when many rankers must be compared, each comparison 
corresponds to a costly live experiment and thus the potential for failure if no Condorcet winner exists is unacceptable 

Q. 

This risk is not merely theoretical. On the contrary, recent experiments on iT-armed dueling bandit problems 
based on information retrieval datasets show that dueling bandit problems without Condorcet winners arise regularly 
in practice |[8] Figure 1]. In addition, we show in Appendix |C.l| that there are realistic situations in ranker evaluation in 
information retrieval in which the probability that the Condorcet assumption holds decreases rapidly as the number of 
arms grows. Since the AT-armed dueling bandit methods mentioned above do not provide regret bounds in the absence 
of a Condorcet winner, applying them remains risky in practice. Indeed, we demonstrate empirically the danger of 
applying such algorithms to dueling bandit problems that do not have a Condorcet winner (cf. Appendix[A|). 

The non-existence of the Condorcet winner has been investigated extensively in social choice theory, where nu¬ 
merous definitions have been proposed, without a clear contender for the most suitable resolution 0. In the dueling 
bandit context, a few methods have been proposed to address this issue, e.g., SAVAGE flOl . PBR im and RankEl 
ca, which use some of the notions proposed by social choice theorists, such as the Copeland score or the Borda score 
to measure the quality of each arm, hence determining what constitutes the best arm (or more generally the top-A: 
arms). In this paper, we focus on finding Copeland winners, which are arms that beat the greatest number of other 
arms, because it is a natural, conceptually simple extension of the Condorcet winner. 
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Unfortunately, the methods mentioned above come with bounds of the form 0{K^\ogT). In this paper, we 
propose two new if-armed dueling bandit algorithms for the Copeland setting with significantly improved bounds. 

The first algorithm, called Copeland Confidence Bound (CCB), is inspired by the recently proposed Relative 
Upper Confidence Bound method ifTSll . but modified and extended to address the unique challenges that arise when 
no Condorcet winner exists. We prove anytime high-probability and expected regret bounds for CCB of the form 
0{K^ + KlogT). Furthermore, the denominator of this result has much better dependence on the “gaps” arising 
from the dueling bandit problem than most existing results (cf. Sections [3] and |5.1 [ for the details). 

However, a remaining weakness of CCB is the additive 0{K‘^) term in its regret bounds. In applications with large 
K, this term can dominate for any experiment of reasonable duration. For example, at Bing, 200 experiments are run 
concurrently on any given day m, in which case the duration of the experiment needs to be longer than the age of 
the universe in nanoseconds before K log T becomes significant in comparison to K^. 

Our second algorithm, called Scalable Copeland Bandits (SCB), addresses this weakness by eliminating the 
0{K^) term, achieving an expected regret bound of the form 0{K log KlogT). The price of SCB’s tighter regret 
bounds is that, when two suboptimal arms are close to evenly matched, it may waste comparisons trying to determine 
which one wins in expectation. By contrast, CCB can identify that this determination is unnecessary, yielding better 
performance unless there are very many arms. CCB and SCB are thus complementary algorithms for finding Copeland 
winners. 

Our main contributions are as follows; 

1. We propose two new algorithms that address the dueling bandit problem in the absence of a Condorcet winner, one 
designed for problems with small numbers of arms and the other scaling well with the number of arms. 

2. We provide regret bounds that bridge the gap between two groups of results: those of the form 0{KlogT) that 
make the Condorcet assumption, and those of the form 0{K^ logT) that do not make the Condorcet assumption. 
Our bounds are similar to those of the former but are as broadly applicable as the latter. Furthermore, the result for 
CCB has substantially better dependence on the gaps than the second group of results. 

In addition. Appendix [A]presents the results of an empirical evaluation of CCB and SCB using a real-life problem 
arising from information retrieval (IR). The experimental results mirror the theoretical ones. 

2 Problem Setting 

Let K > 2. The K-armed dueling bandit problem HI is a modification of the K-armed bandit problem JBH . The 
latter considers K arms {ai,..., qk} and at each time-step, an arm Oi can be pulled, generating a reward drawn from 
an unknown stationary distribution with expected value pi. The iT-armed dueling bandit problem is a variation in 
which, instead of pulling a single arm, we choose a pair {ai, aj) and receive one of them as the better choice, with the 
probability of ai being picked equal to an unknown constant pij and that of aj being picked equal to pji = 1 — pij. A 
problem instance is fully specified by a preference matrix P = [pij], whose ij entry is equal to pij. 

Most previous work assumes the existence of a Condorcet winner cni: an arm, which without loss of generality 
we label oi, such that pn > ^ for alH > 1. In such work, regret is defined relative to the Condorcet winner. However, 
Condorcet winners do not always exist iiini. In this paper, we consider a formulation of the problem that does not 
assume the existence of a Condorcet winner. 

Instead, we consider the Copeland dueling bandit problem, which defines regret with respect to a Copeland winner, 
which is an arm with maximal Copeland score. The Copeland score of ai, denoted Cpld(ai), is the number of arms aj 
for which Pij > 0.5. The normalized Copeland score, denoted cpld(ai), is simply Without loss of generality, 

we assume that oi,..., ac are the Copeland winners, where C is the number of Copeland winners. We define regret 
as follows: 

Definition 1. The regret incurred by comparing ai and aj is 2cpld(ai) — cpld(ai) — cpld(aj). 

Remark 2. Since our results (see establish bounds on the number of queries to non-Copeland winners, they can 
also be applied to other notions of regret. 


3 Related Work 

Numerous methods have been proposed for the AT-armed dueling bandit problem, including Interleaved Filter iH, 
Beat the Mean |0, Relative Confidence Sampling JS], Relative Upper Confidence Bound (RUCB) lfT3l . Doubler and 


2 



MultiSBM IIT6ll . and mergeRUCB IflTl . all of which require the existence of a Condorcet winner, and often come with 
bounds of the form 0{K log T). However, as observed in ifTSll and Appendix |C.1| real-world problems do not always 
have Condorcet winners. 

There is another group of algorithms that do not assume the existence of a Condorcet winner, but have bounds of 
the form 0{K‘^ log T) in the Copeland setting: Sensitivity Analysis of VAriables for Generic Exploration (SAVAGE) 
ifTOl . Preference-Based Racing (PBR) ifTTIl and Rank Elicitation (RankEl) ifT^ . All three of these algorithms are 
designed to solve more general or more difficult problems, and they solve the Copeland dueling bandit problem as a 
special case. 

This work bridges the gap between these two groups by providing algorithms that are as broadly applicable as the 
second group but have regret bounds comparable to those of the first group. Eurthermore, in the case of the results for 
CCB, rather than depending on the smallest gap between arms ai and aj, Amin :=mini>j \Pij — 0.5|, as in the case 
of many results in the Copeland settingj^our regret bounds depend on a larger quantity that results in a substantially 
lower upper-bound, cf. eu 

In addition to the above, bounds have been proven for other notions of winners, including Borda CMa, Random 
Walk ifTTlfTSll . and very recently von Neumann ifT^ . The dichotomy discussed also persists in the case of these results, 
which either rely on restrictive assumptions to obtain a linear dependence on K or are more broadly applicable, at the 
expense of a quadratic dependence on K. A natural question for future work is whether the improvements achieved in 
this paper in the case of the Copeland winner can be obtained in the case of these other notions as well. 

A related setting is that of partial monitoring games ll20ll . While a dueling bandit problem can be modeled as a 
partial monitoring problem, doing so yields weaker results. In ETIl . the authors present problem-dependent bounds 
from which a regret bound of the form 0{K^ log T) can be deduced for the dueling bandit problem, whereas our work 
achieves a linear dependence in K. 

4 Method 

We now present two algorithms that find Copeland winners. 

4.1 Copeland Confidence Bound (CCB) 

CCB (see Algorithm [^1 is based on the principle of optimism followed by pessimism: it maintains optimistic and 
pessimistic estimates of the preference matrix, i.e., matrices U and L (Line 6). It uses U to choose an optimistic 
Copeland winner Oc (Lines 7-9 and 11-12), i.e., an arm that has some chance of being a Copeland winner. Then, it 
uses L to choose an opponent ad (Line 13), i.e., an arm deemed likely to discredit the hypothesis that Oc is indeed a 
Copeland winner. 

More precisely, an optimistic estimate of the Copeland score of each arm Ui is calculated using U (Line 7), and 
Oc is selected from the set of top scorers, with preference given to those in a shortlist, Bt (Line 11). Theses are arms 
that have, roughly speaking, been optimistic winners throughout history. To maintain Bt, as soon as CCB discovers 
that the optimistic Copeland score of an arm is lower than the pessimistic Copeland score of another arm, it purges the 
former from Bt (Line 9B). 

The mechanism for choosing the opponent ad is as follows. The matrices U and L dehne a confidence interval 
around pij for each i and j. In relation to Oc, there are three types of arms: (1) arms aj s.t. the confidence region 
of Pcj is strictly above 0.5, (2) arms aj s.t. the conhdence region of p^j is strictly below 0.5, and (3) arms Oj s.t. the 
conhdence region of p^j contains 0.5. Note that an arm of type (1) or (2) at time t' may become an arm of type (3) 
at time t > t' even without queries to the corresponding pair as the size of the conhdence intervals increases as time 
goes on. 

CCB always chooses ad from arms of type (3) because comparing Oc and a type (3) arm is most informative about 
the Copeland score of Oc. Among arms of type (3), CCB favors those that have conhdently beaten arm Oc in the past 
(Line 13), i.e., arms that in some round t' < t were of type (2). Such arms are maintained in a shortlist of “formidable” 
opponents (BD that are likely to conhrm that at is not a Copeland winner; these arms are favored when selecting ad 
(Lines 10 and 13). 

The sets Bl are what speeds up the elimination of non-Copeland winners, enabling regret bounds that scale asymp¬ 
totically with K rather than K^. Specihcally, for a non-Copeland winner at, the set Bl will eventually contain Lc + 1 

‘Cf. Go] Equation 9 in §4.1.1] and m Theorem 1]. 
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Algorithm 1 Copeland Confidence Bound 

Input: A Copeland dueling bandit problem and an exploration parameter a > |. 

1 : W = [wij] •(— Okxk H 2D array of wins: Wij is the number of times beat aj 

2 : Bi = {oi,..., Oif} // potential best arms 

3: B\= 0 for each i = ,K H potential to beat 

4: Lc = KII estimated max losses of a Copeland winner 

5: for f = 1,2,... do 

6: U ■ [Uij ] w+W^ \J W+W^ ^ ] W+W"^ W+W^ ’ ^ii 2 ’ 

7: Cpld(ai) = # {fc I Wife > 1 ,^ 7 ^*} and Cpld (ai) = #{k\lik> fc 7 ^ i} 

8 : Ct = {ai I Cpld(ai) = max^ Cpld(aj)} 

9: Set Bt ^ ^t-i and Bl ^ and update as follows: 

A. Reset disproven hypotheses: If for any i and aj G Bl we have lij > 0.5, reset Bt, Lq and Bt for all k (i.e. 

set them to their original values as in Lines 2-4 above). 

B. Remove non-Copeland winners: For each at G Bt, if Cpld(a i) < Cpld(aj) holds for any j, set Bt ^ 

Bt \ {at}, and if \Bl\ 7 ^ Lc + 1, then set Bl <r- {ak\uik < 0.5}. However, if Bt = 0 , reset Bt, Lc and B^ 
for all k. 

C. Add Copeland winners: For any at G Ct with Cpld(ai) = Cpld(ai), set Bt BtLi { 0 ^}, ^ 0 and 

Lc K — \ — Cpld(ai). For each j 7 ^ i, if we have \B{\ < Lc + 1, set Bl ^ 0 , and if \B{ \ > Lc +1, 

randomly choose Lc +1 elements of Bl and remove the rest. 

10: With probability 1/4, sample (c, d) uniformly from the set {{i,j) \ aj G Bl and 0.5 G [lij,Uij]} (if it is non¬ 

empty) and skip to Line 14. 

11 : If St n Ct 7 ^ 0 , then with probability 2/3, set Ct •«— St H Ct. 

12 : Sample Oc from Ct uniformly at random. 

13: With probability 1/2, choose the set S* to be either SJ or {oi,..., ax} and then set 

d •(— arg max^^jggi | 5 } ujc. If there is a tie, d is not allowed to be equal to c. 

14: Compare arms Oc and and increment Wcd or w^c depending on which arm wins. 

15: end for 

strong opponents for at (Line |4.l| C), where Lc is the number of losses of each Copeland winner. Since Lc is typically 
small (cf Appendix |C.3| , asymptotically this leads to a bound of only ©(logT) on the number of time-steps when at 
is chosen as an optimistic Copeland winner, instead of a bound of 0{K\ogT), which a more naive algorithm would 
produce. 

4.2 Scalable Copeland Bandits (SCB) 

SCB is designed to handle dueling bandit problems with large numbers of arms. It is based on an arm-identification 
algorithm, described in Algorithm]^ designed for a PAC setting, i.e., it finds an e-Copeland winner with probability 
1 — 5, although we are primarily interested in the case with e = 0. Algorithm [^relies on a reduction to a iC-armed 
bandit problem where we have direct access to a noisy version of the Copeland score; the process of estimating the 
score of arm at consists of comparing at to a random arm aj until it becomes clear which arm beats the other. The 
sample complexity bound, which yields the regret bound, is achieved by combining a bound for iG-armed bandits and 

Algorithm 2 Approximate Copeland Bandit Solver 

Input: A Copeland dueling bandit problem with preference matrix P = [pij], failure probability 5 > 0, and approxi¬ 
mation parameter e > 0. Also, define \K] K}. 

1: Define a random variable reward(i) for i G [K] as the following procedure: pick a uniformly random j ^ i 
from [K\, query the pair (ai,aj) sufficiently many times in order to determine w.p. at least 1 — S/K^ whether 
Pij > 1/2; return 1 ifp^ >0.5 and 0 otherwise. 

2 : Invoke Algorithm]^ where in each of its calls to reward(i), the feedback is determined by the above stochastic 
process. 

Return: The same output returned by Algorithm]^ 
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a bound on the number of arms that can have a high Copeland score. 

Algorithm l^calls a AT-armed bandit algorithm as a subroutine. To this end, we use the KL-based arm-elimination 
algorithm (a slight modification of Algorithm 2 in 1221) described in Algorithm in Appendix |I] It implements an 
elimination tournament with conhdence regions based on the KL-divergence between probability distributions. 

Combining this with the squaring trick, a modihcation of the doubling trick that reduces the number of parti¬ 
tions from logT to log log T, the SCB algorithm, described in Algorithm]^ repeatedly calls Algorithm]^ but force- 
terminates if an increasing threshold is reached. If it terminates early, then the identihed arm is played against itself 
until the threshold is reached. 

Algorithm 3 Scalable Copeland Bandits 

Input: A Copeland dueling bandit problem with preference matrix P = [pij] 

1: for all r = 1,2, ... do 

2 : Set T = 2^ and run Algorithm]^ with failure probability log(T)/T in order to hnd an exact Copeland winner 

(e = 0); force-terminate if it requires more than T queries. 

3: Let To be the number of queries used by invoking Algorithm]^ and let be the arm produced by it; query the 

pair (oi, Oi) T — Tq times. 

4: end for 


5 Theoretical Results 

In this section, we present regret bounds for both CCB and SCB. Assuming that the number of Copeland winners and 
the number of losses of each Copeland winner are bounded0CCB’s regret bound takes the form 0{K^ -I- ATlogT), 
while SCB’s is of the form 0{Klog K\ogT). Note that these bounds are not directly comparable. When there are 
relatively few arms, CCB is expected to perform better. By contrast, when there are many arms SCB is expected to be 
superior. Appendix [A|provides empirical evidence to support these expectations. 

Throughout this section we impose the following condition on the preference matrix; 

A There are no ties, i.e., for all pairs (oi, Oj) with i ^ j, we have ^ 0.5. 

This assumption is not very restrictive in practice. For example, in the ranker evaluation setting from information 
retrieval, each arm corresponds to a ranker, a complex and highly engineered system, so it is unlikely that two rankers 
are indistinguishable. Furthermore, some of the results we present in this section actually hold under even weaker 
assumptions. However, for the sake of clarity, we defer a discussion of these nuanced differences to Appendix]^ 


5.1 Copeland Confidence Bounds (CCB) 

To analyze Algorithmic consider a AT-armed Copeland bandit problem with arms ai,..., qk and preference matrix 
P = [pij], such that arms ai,... ,ac are the Copeland winners, with C being the number of Copeland winners. 
Throughout this section, we assume that the parameter a in Algorithm[Csatishes a > 0.5, unless otherwise stated. We 
hrst dehne the relevant quantities: 


Peftnitioiv3. .Given the above setting we deiiney\ 
1. Li := {Oj \ pij < U.5|, i.e., the anns to whicn-n 

■ — \‘Pij 0.5| and •— miuj^j ^ij 


i loses, and Lc '■= |Ti|. 


Given i > C, define i* as the index of the {Lc + 1)*^ largest element in the set | pij < 0.5}. 

Define A* to be An* ifi > C and 0 otherwise. Moreover, let us set A^^^ := minjx; A*. 

Define AL to be A* + Aij ifpij > 0.5 and max{A*, A^j otherwise ^ 

A := min {mini<c<j Aij, A^j^^j, where is defined as in item^ above. 

C{5) := ((4a - l)A:V(2a - 1)6) ^ where a is as in Algorithm^ 

Nfj (t) is the number of time-steps between times C{5) and t when Oi was chosen as the optimistic Copeland winner 
and Qj as the challenger. Also, Nfj{f) is defined to be (4a Inf)/ (A)})^ ifi j, 0 if i = j > C and t ifi = j < C. 
We also define N^{t) := Nf^{t) + l. 


"^See Appendix C.3|for experimental evidence that this is the case in practice. 
^See Tablesl^n 
^See Figure^a 


[Ifor a summary of the definitions used in this paper. 
) for a pictorial explanation. 
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Using this notation, our expected regret bound for CCB takes the form; O +iC+Lc)KinT \ 

This result is proven in two steps. First, Proposition |^bounds the number or comparisons involving n* 
winners, yielding a result of the form 0{K^ InT). Second, Theorem 
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non-Copeland 

closes the gap between this bound and that 
of Q by showing that, beyond a certain time horizon, CCB selects non-Copeland winning arms as the optimistic 
Copeland winner very infrequently. 

Note that we have > Ay for all pairs i ^ j. Thus, for simplicity, the analysis in this section can be read as if 
the bounds were given in terms of Ay . We use A*j instead because it gives tighter upper bounds. In particular, simply 
using the gaps Ay would replace the denominator of the expression in Q with A^jjj, which leads to a substantially 
worse regret bound in practice. For instance, in the ranker evaluation application used in the experiments, this change 
would on average increase the regret bound by a factor that is of the order of tens of thousands. See Appendix |C.4| for 
a more quantitative discussion of this point. 

We can now state our hrst bound, proved in Appendix [funder weaker assumptions. 

Proposition 4. Given any (5 > 0 and a > 0.5, if we apply CCB (Algorithm^ to a dueling bandit problem satisfying 
Assumption A, the following holds with probability 1 — 6: for any T > C{5) and any pair of arms at and aj, we have 
< iV^.(T). 

One can sum the inequalities in the last proposition over pairs {i,j) to get a regret bound of the form 0{K^ log T) 
for Algorithmic However, as Theorem 11 will show, we can use the properties of the sets Bl to obtain a tighter regret 
bound of the form 0{K \ogT). Before stating that theorem, we need a few dehnitions and lemmas. We begin by 
dehning the key quantity; 

Definition 5. Given a preference matrix P and 5 > 0, then Ts is the smallest integer satisfying 
Ts > C{^)+8K^{Lc + l)^\n^+K^ln^ + ^^^^^^^^^^\nTs + Ni{Ts)+4KmaxNf{Ts). 


i>C 


Remark 6. Ts is poly(iT, S and our regret bound below scales as log Ts. 


The following two lemmas are key to the proof of Theorem [1^ Lemma (proved in Appendix |F]l states that, 
with high probability by time Ts, each set Bl contains Lc + 1 arms aj, each of which beats Oi (i.e., py < 0.5). 
This fact then allows us to prove Lemma [ 8 |(Appendix [0)1, which states that, after time-step Ts, the rate of suboptimal 
comparisons is 0{K InT) rather than 0{K'^ InT). 

Lemma 7. Given <5 > 0, with probability 1 — 5, each set Bf^ with i > C contains exactly Lc + 1 elements with each 
element Oj satisfying pij < 0.5. Moreover, for all f S [T^, T], we have Bl = Bf^. 

Lemma 8. Given a Copeland bandit problem satisfying Assumption A and any (5 > 0, with probability 1 — 5 the 
following holds: the number of time-steps between Ts /2 and T when each non-Copeland winner Oi can be chosen 
as optimistic Copeland winners (i.e., times when arm in Algorithm^ satisfies c > C) is bounded by A® ;= 


2A^ -f 2) 


,ln^, where ■.= Y, 




Nf/^(T). 


Remark 9. Due to Lemma y] with high probability we have Nq < for each i > C and so the total 

number of times between Ts and T when a non-Copeland winner is chosen as an optimistic Copeland winner is in 
0(KLc In T)for a fixed minimal gap The only other way a suboptimal comparison can occur is if a Copeland 

winner is compared against a non-Copeland winner, and according to Proposition^ the number of such occurrences 
is bounded by 0{KC\nT). Hence, the number of suboptimal comparisons is in 0{KlnT) assuming that C and Lc 
are bounded. In Appendix\C.3\ we provide experimental evidence for this. 


We now dehne the quantities needed to state the main theorem. 

( 1 ) 


Definition 10. We define the following three quantities: := C{5/A) -\- N^(Ts/ 2 ), A^p := ^ 


yTF+T IK 

^i>C A* S 


and ;= 


<C<] (Ay)2 


2 E. 


Lc + 1 

>c 
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Theorem 11. Given a Copeland bandit problem satisfying Assumption A and any 6 > 0 and a > 0.5, with probability 
1 — 5, the regret accumulated by CCB is bounded by the following: 

+ InT. 

For a general assessment of the above quantities, assuming that Lc and C are both 0(1), the above quantities in 
terms of K become A^^^ = 0{K^), = 0{K\og{K)), = 0{K). Hence, the above bound boils down to the 

expression in 0. We now turn to the proof of the theorem. 

Proof of Theorem [77] Let us consider the two disjoint time-intervals [1, Ts/f\ and {Ts/ 2 , T]: 

[1,T5/2]: In this case, applying Proposition to Ts, we get that the number of time-steps when a non-Copeland 
winner was compared against another arm is bounded by A^^^. As the maximum regret such a comparison can incur 
is 1, this deals with the first term in the above expression. 

(T, 5 / 2 )T]: In this case, applying Lemma|^ we get the other two terms in the above regret bound. □ 

Now that we have the high probability regret bound given in Theorem[TT] we can deduce the expected regret result 
claimed in 0 for a > 1, as a corollary by integrating S over the interval [0,1]. 


5.2 Scalable Copeland Bandits 

We now turn to our regret result for SCB, which lowers the dependence in the additive constant of CCB’s regret 
result to K log K. We begin by defining the relevant quantities: 


Definition 12. Given a K-armed Copeland bandit problem and an arm ai, we define the following: 

1. Recall that cpld(ai) := Cpld(ai)/ {K — 1) is called the normalized Copeland score. 

2. ai is an e-Copeland-winner ifl — cpld(ai) < (1 — cpld(ai)) (1 -I- e). 

3. Ai := max{cpld(ai) — cpld(ai), 1/{K — 1)} and Hi := aW’ •“ Hi. 

4. A- = max{Ai,e(l — cpld(ai))}. 


We now state our main scalability result: 

Theorem 13. Given a Copeland bandit problem satisfying Assumption A, the expected regret of SCB (Algorithm^ is 


■yK (1—cpld(ai)) 


K A2 


log(r), which in turn can be bounded by O y 


( K(Lc+\osK)\o^T 

A 


where Lc 


bounded by O 

and Amin cire as in Definition^ 

Recall that SCB is based on Algorithm]^ an arm-identification algorithm that identifies a Copeland winner with 
high probability. As a result, Theoremf^is an immediate corollary of Lemma 14 obtained by using the well known 
squaring trick. As mentioned in Sectioim!^ the squaring trick is a minor variation on the doubling trick that reduces 
the number of partitions from log T to log log T. 

lis a result for finding an e-approximate Copeland winner (see Definition [T2j^. Note that, for the regret 


Lemma 14 


setting, we are only interested in the special case with e = 0, i.e., the problem of identifying the best arm. 


Lemma 14. Wi£i 
O 


jjrof^abUi^ 1 — 5 , 


K 


m 


V 


ithm^^nds an e-approximate Copeland winner by time 
log(l/5) < O (TLoo (log(Ar) -f min {e~^, Lc})) log(l/5). 


In particular when there is a Condorcet winner (cpld(ai) = 1, Lc = OJ or more 
generally cpld(ai) = 1 — 0(1/K), Lc = 0{1), an exact solution is found with probability at least 1 — 5 by using an 
expected number of queries of at most O {Hoo{Lc -f log AT)) log(l/5). 


In the remainder of this section, we sketch the main ideas underlying the proof of LemmaflA} detailed in Appendix 
[H| We first treat the simpler deterministic setting in which a single query suffices to determine which of a pair of arms 
beats the other. While a solution can easily be obtained using K{K — l)/2 many queries, we aim for one with query 
complexity linear in K. The main ingredients of the proof are as follows: 

1. cpld(ai) is the mean of a Bernoulli random variable defined as such: sample uniformly at random an index j from 
the set K} \ {i} and return 1 if ai beats aj and 0 otherwise. 

^The exact expression requires replacing log(l/<5) with \og{KHao/^). 
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2. Applying a KL-divergence based arm-elimination algorithm (Algorithm]^ to the AT-armed bandit arising from the 
above observation, we obtain a bound by dividing the arms into two groups: those with Copeland scores close to 
that of the Copeland winners, and the rest. For the former, we use the result from LemmaflSlto bound the number of 
such arms; for the latter, the resulting regret is dealt with using Lemma [Tb] which exploits the possible distribution 
of Copeland scores. 

Let us state the two key lemmas here: 

Lemma 15. Let D C {ai, ..., uk} be the set of arms for which cpld(ai) > 1 — d/ {K — 1), that is arms that are 
beaten by at most d arms. Then \D\ < 2d 

Proof Consider a fully connected directed graph, whose node set is D and the arc (a^, aj) is in the graph if arm 
beats arm Oj. By the dehnition of cpld, the in-degree of any node i is upper bounded by d. Therefore, the total number 
of arcs in the graph is at most \D\d. Now, the full connectivity of the graph implies that the total number of arcs in the 
graph is exactly |D|(|D| — l)/2. Thus, |D|(|D| — l)/2 < |L>|d and the claim follows. □ 

Lemma 16. The sum E{*|cpid(a,)<i} i-cpM(a.) « 0{K\ogK). 

Proof Follows from Lemma [TS] via a careful partitioning of arms. Details are in Appendix [H| □ 

Given the structure of Algorithm]^ the stochastic case is similar to the deterministic case for the following reason: 
while the latter requires a single comparison between arms ai and aj to determine which arm beats the other, in the 

. Ill )/< 5 ) *11 

stochastic case, we need roughly- .2 -comparisons between the two arms to correctly answer the same 

question with probability at least 1 — SlK'^. 

6 Conclusion 

In many applications that involve learning from human behavior, feedback is more reliable when provided in the form 
of pairwise preferences. In the dueling bandit problem, the goal is to use such pairwise feedback to hnd the most 
desirable choice from a set of options. Most existing work in this area assumes the existence of a Condorcet winner, 
i.e., an arm that beats all other arms with probability greater than 0.5. Even though these results have the advantage 
that the bounds they provide scale linearly in the number of arms, their main drawback is that in practice the Condorcet 
assumption is too restrictive. By contrast, other results that do not impose the Condorcet assumption achieve bounds 
that scale quadratically in the number of arms. 

In this paper, we set out to solve a natural generalization of the problem, where instead of assuming the existence 
of a Condorcet winner, we seek to hnd a Copeland winner, which is guaranteed to exist. We proposed two algorithms 
to address this problem: one for small numbers of arms, called CCB; and a more scalable one, called SCB, that works 
better for problems with large numbers of arms. We provided theoretical results bounding the regret accumulated by 
each algorithm: these results improve substantially over existing results in the literature, by hlling the gap that exists 
in the current results, namely the discrepancy between results that make the Condorcet assumption and are of the form 
0(K log T) and the more general results that are of the form 0{K^ log T). 

Moreover, we have included empirical results on both a dueling bandit problem arising from a real-life application 
domain and a large-scale synthetic problem used to test the scalability of SCB. The results of these experiments 
show that CCB beats all existing Copeland dueling bandit algorithms, while SCB outperforms CCB on the large-scale 
problem. 

One open question raised by our work is how to devise an algorithm that has the benehts of both CCB and SCB, 
i.e., the scalability of the latter together with the former’s better dependence on the gaps. At this point, it is not clear 
to us how this could be achieved. 

Another interesting direction for future work is an extension of both CCB and SCB to problems with a continuous 
set of arms. Given the prevalence of cyclical preference relationships in practice, we hypothesize that the non-existence 
of a Condorcet winner is an even greater issue when dealing with an inhnite number of arms. Given that both our 
algorithms utilize confidence bounds to make their choices, we anticipate that continuous-armed UCB-style algorithms 
like those proposed in Il23l429l can be combined with our ideas to produce a solution to the continuous-armed Copeland 
bandit problem that does not rely on the convexity assumptions made by algorithms such as the one proposed in 1 ^ . 
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Finally, it is also interesting to expand our results to handle scores other than the Copeland score, such as an e- 
insensitive variant of the Copeland score (as in ifT^ '). or completely different notions of winners, such as the Borda, 
the Random Walk or the von Neumann winners (see, e.g., caEi]). 
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Appendix 


A Experimental Results 

To evaluate our methods CCB and SCB, we apply them to three Copeland dueling bandit problems. The first is a 
5-armed problem arising from ranker evaluation in the field of information retrieval (IR) 1321. The second is a 500- 
armed synthetic example created to test the scalability of SCB. The third is an example with a Condorcet winner which 
shows how CCB compares against RUCB when the condition required by RUCB is satisfied. 

All three experiments follow the experimental approach in iDiia and use the given preference matrix to simulate 
comparisons between each pair of arms (ci, Oj) by drawing samples from Bernoulli random variables with mean pij. 
We compare our two proposed algorithms against the state of the art iX-armed dueling bandit algorithm, RUCB flBl . 
and Copeland SAVAGE, PBR and RankEl. We include RUCB in order to verify our claim that AT-armed dueling 
bandit algorithms that assume the existence of a Condorcet winner have linear regret if applied to a Copeland dueling 
bandit problem without a Condorcet winner. Note that in all our plots, the horizontal time axes use a log scale, while 
the vertical axes, which measure cumulative regret, use a linear scale. 

The first experiment uses a 5-armed problem arising from ranker evaluation in the field of information retrieval 
(IR) jm, detailed in Appendix [B] Figure shows the regret accumulated by CCB, SCB, the Copeland variants of 
SAVAGE, PBR and RankEl, as well as RUCB on this problem. CCB outperforms all other algorithms in this 5-armed 
experiment. 
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Figure 1: Small-scale regret results for a 5-armed Copeland dueling bandit problem arising from ranker evaluation. 


Note that three of the baseline algorithms under consideration here (i.e., SAVAGE, PER and RankEl) require 
the horizon of the experiment as an input. Therefore, we ran independent experiments with varying horizons and 
recorded the accumulated regret; the markers on the curves corresponding to these algorithms represent these numbers. 
Consequently, the regret curves are not monotonically increasing. For instance, SAVAGE’S cumulative regret at time 
2 X 10^ is lower than at time 10^ because the runs that produced the former number were not continuations of those 
that resulted in the latter, but rather completely independent. Furthermore, RUCB’s cumulative regret grows linearly, 
which is why the plot does not contain the entire curve. 

The second experiment uses a 500-armed synthetic example created to test the scalability of SCB. In particular, 
we fix a preference matrix in which the three Copeland winners are in a cycle, each with a Copeland score of 498, and 
the other arms have Copeland scores ranging from 0 to 496. 

Figure]^ which depicts the results of this experiment, shows that when there are many arms, SCB can substantially 
outperform CCB. We omit SAVAGE, PBR and RankEl from this experiment because they scale poorly in the number 
of arms inoHni. 

The reason for the sharp transition in the regret curves of CCB and SCB in the synthetic experiment is as follows. 
Because there are many arms, as long as one of the two arms being compared is not a Copeland winner, the comparison 
can result in substantial regret; since both algorithms choose the second arm in each round based on some criterion 
other than the Copeland score, even if the hrst chosen arm in a given time-step is a Copeland winner, the incurred 
regret may be as high as 0.5. The sudden transition in Figure [^occurs when the algorithm becomes confident enough 
of its choice for the hrst arm to begin comparing it against itself, at which point it stops accumulating regret. 

The third experiment is an example with a Condorcet winner designed to show how CCB compares against RUCB 
when the condition required by RUCB is satished. The regret plots for SAVAGE and SCB were excluded here since 
they both perform substantially worse than either RUCB or CCB, as expected. This example was extracted in the same 
fashion as the example used in the ranker evaluation experiment detailed in Appendix [B] with the sole difference that 
this time we ensured that one of the rankers is a Condorcet winner. The results, depicted in Figure show that CCB 
enjoys a slight advantage over RUCB in this case. We attribute this to the careful process of identifying and utilizing 
the weaknesses of non-Copeland winners, as carried out by lines 12 and 18 of Algorithmic 
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B Ranker Evaluation Details 


A ranker is a function that takes as input a user’s search query and ranks the documents in a collection according to 
their relevance to that query. Ranker evaluation aims to determine which among a set of rankers performs best. One 
effective way to achieve this is to use interleaved comparisons 13^ . which interleave the ranked lists of documents 
proposed by two rankers and present the resulting list to the user, whose subsequent click feedback is used to infer a 
noisy preference for one of the rankers. Given a set of K rankers, the problem of finding the best ranker can then be 
modeled as a itT-armed dueling bandit problem, with each arm corresponding to a ranker. 

We use interleaved comparisons to estimate the preference matrix for the full set of rankers included with the 
MSLR datasej^ from which we select 5 rankers such that a Condorcet winner does not exist. The MSLR dataset 
consists of relevance judgments provided by expert annotators assessing the relevance of a given document to a given 
query. Using this data set, we create a set of 136 rankers, each corresponding to a ranking feature provided in the 
data set, e.g., PageRank. The ranker evaluation task in this context corresponds to determining which single feature 
constitutes the best ranker m. 

To compare a pair of rankers, we use probabilistic interleave (PI) ll34l . a recently developed method for interleaved 
comparisons. To model the user’s click behavior on the resulting interleaved lists, we employ a probabilistic user 
model Il34ll35l that uses as input the manual labels (classifying documents as relevant or not for given queries) provided 
with the MSLR dataset. Queries are sampled randomly and clicks are generated probabilistically by conditioning on 
these assessments in a way that resembles the behavior of an actual user 061 . Specifically, we employ an informational 
click model in our ranker evaluation experiments 071 . 

The informational click model simulates the behavior of users whose goal is to acquire knowledge about multiple 
facets of a topic, rather than seeking a specific page that contains all the information that they need. As such, in the 
informational click model, the user tends to continue examining documents even after encountering a highly relevant 
document. The informational click model is one of the three click models utilized in the ranker evaluation literature, 
along with the perfect and navigational click models ll37ll . It turns out that the full preference matrix of the feature 
vectors of the MSLR dataset has a Condorcet winner when the perfect or the navigational click-models are used. As 
we will see in Appendix |C.1| using the informational click model that is no longer true. 



Figure 2: Large-scale regret results for a synthetic 500-armed Copeland dueling bandit problem. 


^ http://research.microsoft.com/en-us/projects/mslr/default.aspx 
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MSLR Condorcet Example with 5 Rankers 



Figure 3; Regret results for a Condorcet example. 

Following ElEl, we first use the above approach to estimate the comparison probabilities pij for each pair of 
rankers and then use these probabilities to simulate comparisons between rankers. More specihcally, we estimate the 
full preference matrix, called the informational preference matrix, by performing 400,000 interleaved comparisons on 
each pair of the 136 feature rankers. 

C Assumptions and Key Quantities 

In this section, we provide quantitative analysis of the various assumptions, definitions and quantities that were dis¬ 
cussed in the main body of the paper. 

C.l The Condorcet Assumption 

To test how stringent the Condorcet assumption is, we use the informational preference matrix described in Section 
[B]to estimate for each K = 1,..., 136 the probability Pk that a given iC-armed dueling bandit problem, obtained 
from considering K of our 136 feature rankers, would have a Condorcet winner by randomly selecting 10, 000 in¬ 
armed dueling bandit problems and counting the ones with Condorcet winners. As can be seen from Figure as K 
grows the probability that the Condorcet assumption holds decreases rapidly. We hypothesize that this is because the 
informational click model explores more of the list of ranked documents than the navigational click model, which was 
used in US, and so it is more likely to encounter non-transitivity phenomena of the sort described in Il38l . 

C.2 Other Notions of Winners 

As mentioned in Section]^ numerous other dehnitions of what constitutes the best arm have been proposed, some of 
which specialize to the Condorcet winner, when it exists. This latter property is desirable both in preference learning 
and social choice theory; the Condorcet winner is the choice that is preferred over all other choices, so if it exists, there 
is good reason to insist on selecting it. The Copeland winner, as discussed in this paper, and the von Neumann winner 
m satisfy this property, while the Borda (a.k.a. Sum of Expectations) and the Random Walk (a.k.a. PageRank) 
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winners ll^ do not. The von Neumann winner is in fact defined as a distribution over arms such that playing it 
will maximize the probability to beat any fixed arm. The Borda winner is defined as the arm maximizing the score 
'Yhj^iPij can be interpreted as the arm that beats other arms by the most, rather than beating the most arms. 
The Random Walk winner is defined as the arm we are most likely to visit in some Markov Chain determined by the 
preference matrix. In this section, we provide some numerical evidence for the similarity of these notions in practice, 
based on the sampled preference matrices obtained from the ranker evaluation from IR, which was described in the 
last section. Table [T] lists the percentage of preference matrices for which pairs of winner overlapped. In the case of 
the von Neumann winner, which is defined as a probability distribution over the set of arms im, we used the support 
of the distribution (i.e., the set of arms with non-zero probability) to define overlap with the other definitions. 

Table 1; Percentage of matrices for which the different notions of winners overlapped 


Overlap 

Copeland 

von Neumann 

Borda 

Random Walk 

Copeland 

%100 

%99.94 

%51.49 

%56.15 

von Neumann 

%99.94 

%100 

%77.66 

%82.11 

Borda 

%51.49 

%77.66 

%100 

%94.81 

RandomWalk 

%56.15 

%82.11 

%94.81 

%100 


As these numbers demonstrate, the Copeland and the von Neumann winners are very likely to overlap, as are the 
Borda and Random Walk winners, while the first two definitions are more likely to be incompatible with the latter 
two. Furthermore, in the case of %94.2 of the preference matrices, all Copeland winners were contained in the support 
of the von Neumann winner, suggesting that in practice the Copeland winner is a more restrictive notion of what 
constitutes a winner. 



Figure 4; The probability that the Condorcet assumption holds for subsets of the feature rankers. The probability is 
shown as a function of the size of the subset. 
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C.3 The Quantities C and Lc 


We also examine additional quantities relevant to our regret bounds: the number of Copeland winners, C; the number 
of losses of each Copeland winner, Lc', and the range of values in which these quantities fall. Using the above 
randomly chosen preference sub-matrices, we counted the number of times each possible value for C and Lq was 
observed. The results are depicted in Figure]^ the area of the circle with coordinates {x,y) is proportional to the 
percentage of examples with K = x which satisfied C = y {in the top plot) or Lc = y (in the bottom plot). As these 
plots show, the parameters C and Lc are generally much lower than K. 

C.4 The Gap A 

The regret bound for CCB, given in Q, depends on the gap A defined in Definition |3|6[ rather than the smallest 
gap Amin as specified in Definition |3|2| The latter would result in a looser regret bound and Figure [^quantifies this 
deterioration in the ranker evaluation example under consideration here. In particular, the plot depicts the average of 
the ratio between the two bounds (the one using A and the one using Amin) across the 10, 000 sampled preference 
matrices used in the analysis of the Condorcet winner for each K in the set {2,..., 135}. The average ratio decreases 
as the number of arms approaches 136 because, as K increases, the sampled preference matrices increasingly resemble 
the full preference matrix and so their gaps A and Amin approach those of the full 136-armed preference matrix as 
well. As it turns out, the ratio A^/Amin for the full matrix is equal to 1,419. Hence, the curve in Figure [^approaches 
that number as the number of arms approaches 136. 


Number of Copeland winners 



Number of losses of Copeland winners 

3 

o 2 
^ 1 
0 

0 20 40 60 80 100 120 

Number of arms 



Figure 5: Observed values of the parameters C and Lc'- the area of the circle with coordinates {x, y) is proportional 
to the percentage of examples with K = x which satisfied C = y (in the top plot) or Lc = y in the bottom plot. 
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Figure 6; The average advantage gained by having the bound in Q depend on A rather than Anii„: for each number 
of arms K, the expectation is taken across the 10, 000 iT-armed preference matrices obtained using the sampling 
procedure described above. 

D Background Material 

Maximal Azuma-Hoeffding Bound ll40l §A.1.3]; Given random variables Xi ,..., with common range [0,1] 
satisfying E[A„|Ai,..., A„_i] = p, define the partial sums Sn = Xi A„. Then, for all a > 0, we have 

p{ maxS'n > n/i + a) < 

V n<N ) 

p( min Sn < nfi — o') < 

\n<N / ~ 

Here, we will quote a useful Lemma that we will refer to repeatedly in our proofs; 

Lemma 17 (Lemma 1 in 1131 1. Let P := [pij] be the preference matrix of a K-armed dueling bandit problem with 
arms {ax,..., ok}- Then, for any dueling bandit algorithm and any a > ^ and S > 0, we have 

p(yt > C{6),i,j, py e [hj{t),Uij{t)]^ >1-6. 

E Proof of Proposition 

Before starting with the proof, let us point out the following two properties that can be derived from Assumption A in 
Section |5] 

PI There are no ties involving a Copeland winner and a non-Copeland winner, i.e., for all pairs of arms (a^, aj) 
with i < C < j, we have pij f 0.5. 
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P2 Each non-Copeland winner has more losses than every Copeland winner, i.e., for every pair of arms (oi, Uj), 
with i < C < j,we have \Ci\ < \Cj\. 

Even though we have assumed in the statement of Proposition that Assumption A holds, it turns out that the 
proof provided in this section holds as long as the above two properties hold. 

Proposition |^App(y/ng CCB to a dueling bandit problem satisfying properties PI and P2, we have the following 
bounds on the number of comparisons involving various arms for each T > C{S): for each pair of arms Ui and aj, 
such that either at least one of them is not a Copeland winner or pij f 0.5, with probability 1 — 5 we have 


NfAT)<mT):=\ (A*^) 


4a InT 
2 

Aj) 


0 


if if-j 


ifi=j > C 


( 2 ) 


Proof of Proposition^ We will prove these bounds by considering a number of cases separately: 

1. i < C and pj^ f 0.5: Eirst of all, since is a Copeland winner, this means that according to the definitions in 
Tablesj^andj^ A*j is simply equal to A^; secondly, assuming by way of contradiction that > 0, 

then we have > C{5) and so by Lemma 17 we have with probability 1 — <5 that the confidence interval 


[lij(Tij),Uij{Tij)] contains the preference probability pij. But, in order for arm aj to have been chosen as the 
challenger to Oi, we must also have 0.5 S [hj{Tij),Uij{Tij)]\, to see this, let us consider the two possible cases: 


(a) If we have p^j > 0.5, then having 


0.5 ^ [hj{Tij),Uij{Tij)] 


implies that we have lijfyij) > 0.5, which in turn implies 


^jii't'ij') — 1 A 0.5 — ‘Uiifyij'), 


but this is impossible since in that case at would’ve been chosen as the challenger. 

(b) If we have pij < 0.5, then have 

0.5 \lij{Tij),Uij(Tij)\ 

implies that we have Uijfyij) < 0.5, but this is impossible because it means that we had Ijifyij) > 0.5, and 
CCB would’ve eliminated it from considerations in its second round. 

So, in either case, we cannot have 0.5 [lij{Tij), Uijfyij)]. Therefore, at time r^-, we must have had Uijfyij) — 
lijfyij) > \pij — 0.5| =: Aij. Erom this, we can conclude the following, using the definition of Uij and 




'In' 




A- > A ■ 


/ alnT 

2j—-- > A,; 




■■■ 


Aa InT 
A2. ’ 


If In: 


> A,. 


Nij (Tij ) 

Nfjinj) < N.jfyij) 


Tij < T 


giving us the desired bound. The reader is referred to Eigurej^for an illustration of this argument. 

2. C < i: Let us deal with the two cases included in Inequality (|^ separately: 

(a) i = j > C: In plain terms, this says that with probability 1 — 5 no non-Copeland winner will be compared 
against itself after time C{d). The reason for this is the following set of facts: 
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Figure 7; This figure illustrates the definition of the quantities A* and A*^ in the case that arm is a Copeland winner, 
as well as the idea behind Case 1 in the proof of Proposition In this setting we have A* = 0 and A*^ = Aij. On 
the one hand, by Lemma 17 we know that the confidence intervals will contain the pij (the blue dots in the plots), and 
on the other as soon as the confidence interval of pij stops containing 0.5 for some arm aj, we know that it could not 
be chosen to be compared against ai. In this way, the gaps A*^ regulate the number of times that arm each arm can be 
chosen to be played against ai during time-steps when ai is chosen as optimistic Copeland winner. 


• Since ai is a non-Copeland winner, we have by Property PI that it loses to more arms than any Copeland 
winner. 


For ai to have been chosen as an optimistic Copeland winner, it has to have (optimistically) lost to no more 
than Lc arms, which means that there exists an arm k such that pik < 0.5, but Uik > 0.5. 

By Lemma 17 for all time steps after C{5), we have lik < Pik < 0.5, and so in the second round we have 
Uki > 0.5 = Uii, and so ai could be not chosen as the challenger to itself. 


(b) i 7 ^ j: In the case that ai is not a Copeland winner and aj is different from ai, we distinguish between the 
following two cases, where A* is defined as in Tables [^and[^ 

i. Pij < 0.5 — A*: In this case, the definition of A* reduces to Aij. Now, since when choosing the challenger, 
CCB eliminates from consideration any arm aj that has Iji > 0.5, the last time-step after C{S) when 
aj was chosen as the challenger for ai, we must’ve had Uij{Tij) := 1 — lji{Tij) > 0.5. On the other hand, 
Lemma[T7] implies that we must also have lij{Tij) < pij, and therefore, we have Uij{Tij) — lijiuj) > Ay; 
so, doing the same calculation as in part[T]of this proof, we have 


j'ij(Ty) hj{Tij) 2 


'In' 




ahiTii 


_AL > A 




'ijK'n) 
a InT 
'^ij ('’'ij ) 






Aa InT 

1 3 


^ijirij) 


— 


N-^inj) < Nijinj) 


Tij < T 


ii. Pij > 0.5 — A*: Repeating the above argument about Uij{Tij), we can deduce that Uij{Tij) 


> 0.5 must 
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two together we get 


Uijinj) > max{0.5,py}. 


(3) 


On the other hand, we will show next that with probability 1 — <5, we have lij{Tij) < 0.5 — A*; this is a 

consequence of the following facts: 

• Since was chosen as the optimistic Copeland winner, we can deduce that had no more that Lq 
optimistic losses. 

• Let ak^,aki be the I < Lc arms to which lost optimistically during time-step . Then, the 
smallest pik with k ^ {ki,..., ki}, must be less than to equal to the {Lq + 1}*^ smallest element in the 
set {pik I fc = 1,.. .,K}. 

• This, in turn, is equal to the {Lq + 1}*^ smallest element in the set {pik\pik < 0.5} (since this latter set 
of numbers are the smallest ones in the former set). But, this is equal to 0.5 — A* by dehnition. 

So, we have the desired bound on lij [rij) and combining this with Inequality (|^, we have 


“ hjinj) > max{0,py - 0.5} + A* = AT, 


where the last equality follows directly from the dehnition of and the fact that pij > 0.5 — A*. Now, 
repeating the same calculations as before, we can conclude that with probability 1 — 5, we have 



A pictorial depiction of the various steps in this part of the proof can be found in Figure]^ 


□ 
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Figure 8: This figure illustrates the definition of the quantities A* and A|j^in the case that arm Oi is not a Copeland 
winner, as well as the idea behind Case 2 in the proof of Proposition ffl The bottom row of plots in the figure 
corresponds to the confidence intervals around probabilities pij (depicted using the blue dots) for j = 1,..., iC, 
while the top row corresponds to those for probabilities pij, where oi is by assumption one of the Copeland winners 
(although we could use any other Copeland winner instead). 


The two boxes in the top row with red intervals represent arms to which ai loses (i.e. pij < 0.5), the number of which 
happens to be 2 in this example, which means that Lq = 2. Now, by Definition ^ i* is the index with the index j 
with the {Lc + 1)*^ (in this case 3’’'^) lowest and since the three lowest pij in this example are PiK,PiC and pu*, 
this means that the column labeled as ai* is indeed labeled correctly. Given this. Definition |3]^ tells us that A* is the 
size of the gap shown in the block corresponding to pair (a^, ai*). 

Moreover, by Definition 13151 the gap A*j is defined using one of the following three cases: (1) if we have pij < pa* 
(as with the ones with red confidence intervals in the bottom row of plots), then we get A*^ := Aij = 0.5 — ; (2) if 

we have pu* < pij < 0.5 (as in the plots in the 2"“^, 3’’^^ and 7*^ column of the bottom row), then we get A*^ := A*; 
(3) if we have 0.5 < pij (as in the 1®* and 6*^ column in the bottom row), then we get A*j := Aij + A*. 

The reasoning behind this trichotomy is as follows: in the case of arms Oj in group (1), they are not going to be chosen 
to be played against as soon as top of the interval goes below 0.5, and by Lemma [TT} we know that the bottom of 
the interval will be below py. In the case of the arms in groups (2) and (3), the bottom of their interval needs to be 
below Pa* because otherwise that would mean that neither arm ai* nor arms in group (1) were eligible to be included 
in the argmax expression in Line 13 of Algorithm[T] which can only happen if we have Uij < 0.5 for j = i* as well 
as the arms in group (1), from which we can deduce that the optimistic Copeland score of must have been lower 
than K — 1 — Lq, and so ai could not have been chosen as an optimistic Copeland winner. Using the same argument, 
we can also see that the tops of the confidence intervals cc^esponding to arms in group (2) must be above 0.5, or else 
it would be impossible for Ui to be chosen as an optimistic Copeland winner. Moreover, by Lemma 17 the intervals 
of the arms aj in group (3) must contain py. 


















































































F Proof of Lemma 0 

Let us begin with the following direct corollary of Proposition 

Corollary 18. Given any i5 > 0, any T > C{S) and any sub-interval of length N^{T) := Nfj{T) + L with 

probability 1 — (5, there is at least one time-step when there exists c < C such that 

Cpld(ac) = Cpld(ac) = Cpld(ac) 

> Cpld(a, ) V j, (4) 

Proof. According to Proposition with probability 1 — there are at most ^ijC^) time-steps between C{S) 
and T when Algorithm [T] did not compare a Copeland winner against itself: i.e. c and d in Algorithm[2did not satisfy 

c = d<C. 

In other words, during this time-period, in any sub-interval of length (T) := (^) + 1’ there is at least 

one time-step when a Copeland winner was compared against itself. During this time-step, we must have had 

Cpld(ac) = Cpld(ac) = Cpld(ac) 

>Cpld(oj) Vj, 

where the first two equalities are due to the fact that in order for Algorithm[^to set c = d, we must have 0.5 ^ [Icj , Ucj] 
for each j c, or else would not be played against itself; on the other hand, the last inequality is due to the fact 
that Qc was chosen as an optimistic Copeland winner by Line 8 of Algorithmic so its optimistic Copeland score must 
have been greater than or equal to the optimistic Copeland score of the rest of the arms. □ 

Lemma 19. If there exists an arm ai with i > C such that contains an arm aj that loses to ai (i.e. pij > 0.5J 

or such that contains fewer than Lq -b 1 arms, then the probability that by time-step Tq the sets Bl and Bt are 

not reset by Line 9.A of Algorithm^is less than i5/6, where we define 

To ■.= C{5/2) + N^/^(Tg) 

Z2aK{Lc+ l)\nTg 

mm 

+ BK'^{Lc + lf\vr^. 

0 

Proof. By Line 9. A of AlgorithmjC as soon as we have Uj > 0.5, the set Bl will be emptied. In what follows, we will 
show that the probability that the number of time-steps before we have l^j > 0.5 is greater than 


AT := N^/^{Tg) + N 


with 


N := 


32aK{Lc -b l)lnT5 


-b8A:2(Lc-bl)^ln 




is bounded by 5I6K^. This is done using the amount of exploration infused by Line 10 of Algorithmic To begin, let 
us note that by Corollary 18 there is a time-step before Tq := C{5/2) -b N^^'^{Tg) when the condition of Line 9.C of 
Algorithm|Cis satisfied for some Copeland winner. At this point, if Bl contains fewer than Lq + 1 elements, then it 
will be emptied; furthermore, for all k > C, the sets will have at most Lc + 1 elements and so the set 


St := {{kjfae G B^ and 0.5 G [Ikt.Uki]} 


contains at most K{Lc +1) elements for all t > Tq. Moreover, if at time-step Ti := C{S/2)-\-AT we have aj G Bf^, 
then we can conclude that {i,j) G St for all t G [C{5/2),Ti\, since, if at any time after C{5/2) arm aj were to be 
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removed from Bl, it will never be added back because that can only happen through Line 9.B of Algorithmand by 
Lemma 17 and the assumption of the lemma we have Uij > Pij > 0.5. 


What we can conclude from the observations in the last paragraph is that if at time-step Ti we still have aj £ , 

then there are AT time-steps during which the probability of comparing arms and Uj was at least and 

yet no more than comparisons took place, since otherwise, we would have lij > 0.5 at some point before 

ij 

Ti. Now, let denote the indicator random variable that is equal to 1 if arms Oi and Oj were chosen to be played 
against each other by Line 10 of Algorithm[2during time-step Ti + n. Also, let Ai,..., A^r be iid Bernoulli random 
variables with mean 
can conclude that 


4:K{Lc + l)- 


Since and A„ are Bernoulli and we have E [B^] < E[A„] for each n, then we 


N 


P 




< s ] < P 


\n—l 



< s for all s. 


On the other hand, we can use the Hoeffding bound to show that the right hand side of the above inequality is 
smaller than S /6 if we set s = : 


N 




\n—l 



4K{Lc + 1) 

4a In Ts 


A2. 

min 



4K{Lc + 1) 


32a^ In^ Tg 


4^ In Tg 


N 


. N iC(Lr' + l)A^ . 8iC4(L^ + l)^ 

min V I / min ^ ^ ' 

4c. In Ts 


^ ^K{Lc + l)^^ 8K^(Lc + iy^ 


Now, if we take a union bound over all pairs of arms and aj satisfying the condition stated at the beginning of 
this scenario, we get that with probability 5/6 by time-step 0(5/2) + AT all such erroneous hypotheses are reset by 
Line 9.A of Algorithm[^ emptying the sets Bl- □ 

Lemma 20. Let ti £ [0(5/2), Ts) be such that for all i, j satisfying aj £ B)^ we have pij < 0.5. Then, the following 
two statements hold with probability 1 — 55/6.' 

1. If the set Bti in Algorithm^contains at least one Copeland winner, then if we set ^2 = 0 + ttmax. where 


^max 


2K max Nf^^iTs) + 

i>C * 


ln(6A/5) 

2 


then Bt 2 is non-empty and contains no non-Copeland winners, i.e. for all ai £ Bt^ we have i < C. 

2. If the set Bt^ in Algorithm^contains no Copeland winners, i.e. for all ai £ Bt^, we have i > C, then within n„iax 
time-steps the set Bt will be emptied by Line 9.B of Algorithm^ 

Therefore, with probability 1 — 55/6, hy time ti -f 2nmax non-Copeland winners (i.e. arms ai with i > C) are 
eliminated from Bt. 

Proof. We will consider the two cases in the following, conditioning on the conclusions of Lemma [TT) Proposition]^ 
and Corollary]^ all simultaneously holding with 1 — 5/2: 
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1. contains a Copeland winner (i.e. Cc G for some c < C): in this case, by Lemma 17 we know that the 
Copeland winner will forever remain in the set Bt because 


Cpld(ac) > maxCpld(aj) > maxCpld(aj), 

3 3 - 


then Bt 2 will indeed be empty. Moreover, in what follows, we will show that the probability that any non-Copeland 
winner in Bt is not eliminated by time t 2 is less than 6/6. Let us assume by way of contradiction that there exists 
an arm ab with b > C such that ab is in Bt 2 ■ we will show that the probability of this happening is less than 5/6K, 
and so, taking a union bound over non-Copeland winning arms, the probability that any non-Copeland winner is in 
Bt 2 is seen to be smaller than 6/6. 

Now, to see that the probability of ab being in the set Bt 2 is small, note that the fact that ab being in Bt 2 implies 
that ab was in the set Bt for the entirety of the time interval \C{6/2), t 2 ] as we will show in the following. If ab is 
eliminated from Bt at some point between ti and t 2 , it will not get added back into Bt because that can only take 
place if the set Bt is reset at some point and there are only two ways for that to happen: 


(a) By Line 9.A of Algorithm[^in the case that for some pair (z, j) with aj G B\ we have lij > 0.5; however, this 
is ruled out by our assumption that at time ti we have pij < 0.5 and by Lemma 17 which stipulates that we 
have lij < Pij < 0.5. 

(b) By Line 9.B of Algorithm[T]in the case that all arms are eliminated from Bt, but this cannot happen by the fact 
mentioned above that Uc will not not be removed from Bt. 


So, as mentioned above, we indeed have that at each time-step between ti and t 2 , the set Bt contains ab. Next, 
we will show that the probability of this happening is less than 6/6K. To do so, let us denote by Sb the time-steps 
when arm ab was in the set of optimistic Copeland winners, i.e. 


Sb '.— { t G (^ 1 ,^ 2 ] \ o,b G Ct } . 


We can use Corollary 18 above with T = Tg to show that the size of the set Sb (which we denote by |iS{,|) is 
bounded from below by t 2 — fi — Si/j ^ij'^i'^s)- this is because whenever any Copeland winner Uc is played 
against itself. Equation Q holds, and so if we were to have at, ^ Ct during that time-step ab would have had to get 
eliminated from Bt because at, not being an optimistic Copeland winner would imply that 


Cpld(at,) < Cpld(ac) = Cpld(ac). 


But, we know from facts (a) and (b) above that at, remains in Bt for all t G (fi, ^ 2 ]- Therefore, as claimed, we have 


|5fc| >t2-h-J2 Nt'\Tg) > 2 KNI/\Ts) + =. ^b, (5) 

i^3 

where the last inequality is due to the definition of rimax := ^2 ~ ft- On the other hand. Propositiontells us that 
the number of time-steps between ti and t 2 when at, could have been chosen as an optimistic Copeland winner is 
bounded as 

Nb^\Ts) < Nb^\Ts). (6) 

Furthermore, given the fact that during each time-step t G Sb we have ab G Bt H Ct, the probability of at, being 
chosen as an optimistic Copeland winner is at least 1/K because of the sampling procedure in Lines 14-17 of 
Algorithm fTl However, this is considerably higher than the ratio obtained by dividing the right-hand sides of 
Inequality Q by that of Inequality Q. We will make this more precise in the following: for each t G Sb, denote 
by p/ the probability that arm at, would be chosen as the optimistic Copeland winner by Algorithm and let X/ 
be the Bernoulli random variable that returns 1 when arm at, is chosen as the optimistic Copeland winner or 0 
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otherwise. As pointed out above, we have that n\> ^ for all t G Sb, which, together with the fact that |5b| > Ub, 
implies that the random variable := satishes 


P{Xb < x) < P{Binom{nb, l/AT) < x). (7) 

This is both because the Bernoulli summands of Xb have higher means than the Bernoulli summands of Binomijib, 1 /K) 
and because Xb is the sum of a larger number of Bernoulli variables, so Xb has more mass away from 0 than does 
Binom{nb, l/AT). So, we can bound the right-hand side of Inequality 0 by (5/6iT with x = to get our 

desired result. But, this is a simple consequence of the Hoeffding bound, a more general form of which is quoted 
in Section]^ More precisely, we have 

P (^Binom{nb, l/AT) < = P (^Binom{nb, l/AT) ^ ^ ~ 

with a ^ - Nb^^iTs) 

< = e 

^ ^-2nt/K^+4N^^^(Ts)IK-2N^^^^{Tsf/nb 

< ^-2ntiK'^+iN^^\Ts)/K ^ ^-ln{6K/S) = gjQ^ 


Using the union bound over the non-Copeland winning arms that were in , of whom there is at most AT — 1, we 
can conclude that with probability 6/6 they are all eliminated from Bt 2 - 

2. Bti does not contain any Copeland winners: in this case, we can use the exact same argument as above to conclude 
that the probability that the set Bt is non-empty for all t G (fi, t 2 ] is less than 6/6 because as before the probability 
that each arm Of, G Bt^ is not eliminated within time-steps is smaller than 6/6K. □ 


Let us now state the following consequence of the previous lemmas: 

Lemma 1^ Given 5 > 0, the following fact holds with probability 1 — 6: for each i > C, the set Bf^ contains 
exactly Lq + 1 elements with each element Oj satisfying pij < 0.5. Moreover, for all t G , T], we have Bl = Bfy. 


Proof In the remainder of the proof, we will condition on the high probability event that the conclusions of Lemma 
[TT] Corollary Lem ma [T^ and Lemma [20| all hold simultaneously with probability 1 — (5. 

Combining Lemma|20[ we can conclude that by time-step Ti := To-|-2nmax all non-Copeland winners are removed 
from Bti, which also means by Line 9.B of Algorithmthat the corresponding sets Bfy, with i > C are non-empty, 
and Lemma [retells us that these sets have at least Lq + 1 elements Oj each of which beats ai (i.e. pij < 0.5). 

Now, applying Corollary[^ we know that within N^/'^{Ts) time-steps. Line 9.C of Algorithm[^will be executed. 


at which point we will have Lc = Be and so Bl will be reduced to Lc + 1 elements. Moreover, by Lemma |17| for 
all t > Ti and aj G Bl we have lij < pij < 0.5 and so Bl will not be emptied by any of the provisions in Line 9 of 
Algorithm [T] 

Now, since by definition we have >Ti+ we have the desired result. □ 
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G Proof of Lemma [8] 


Lemma Given a Copeland bandit problem satisfying Assumption A and any 5 > 0, with probability 1 — 5 the 
following statement holds: the number of time-steps between and T when each non-Copeland winning arm at 
can be chosen as optimistic Copeland winners (i.e. time-steps when arm in Algorithm^satisfies c = i > C) is 
bounded by 

■- 2 iV^ + 2 ^^ 111 ^, 

where 

^h--= E 

Proof The idea of the argument is outlined in the following sequence of facts: 

1. By Lemmapl we know that with probability 1 — 5/2, for each i > C and all times t > Ts /2 the sets Bl will consist 

of exactly Lq + 1 arms that beat the arm and that Bl = . 

2. Moreover, if at time f > TI 5/2 > C{6/4), Algorithm [T] chooses a non-Copeland winner as an optimistic Copeland 
winner (i.e. i > C), then with probability 1 — 5/4 w^now that 

Cpld(ai) > Cpld(ai) > Cpld(ai) = K —1 — Lq- 


3. This means that there could be at most Lc arms Uj that optimistically lose to ai (i.e. Uij < 0.5) and so at least one 
arm at, € Bl does satisfy Uit > 0.5 

4. This, in turn, means that in Line 13 of Algorithmwith probability 0.5 the arm will be chosen from Bf 

5. By Proposition!^ we know that with probability 1 — 5/4, in the time interval [Tg / 2 , T] each arm aj € Bf^^^ can be 

compared against at at most many times. 

Given that by Fact^above we need at least one arm aj G Bl to satisfy Uij > 0.5 for Algorithm[^to set (c, d) = 
and that by FactWarms from Bl have a higher probability of being chosen to be compared against a^, this means 
that arm will be chosen as optimistic Copeland winner roughly twice as many times we had (c, d) = {i,j) for some 
j G Sfg/ 2 - ^ probability version of the claim in the last sentence together with Factpjwould give us the bound on 
regret claimed by the theorem. In the remainder of this proof, we will show that indeed me number of times we have 
c = z is unlikely to be too many times higher than twice the number of times we get (c, d) = (i,j), where j G Bfs/ 2 ' 
To do so, we will introduce the following notation: 

the number of time-steps between Tg /2 and T when arm was chosen as optimistic Copeland winner. 

the indicator random variable that is equal to 1 if Line 13 in Algorithm decided to choose arm Od only from 
the set Bl^ and zero otherwise, where is the time-step after Tg /2 when arm a^ was chosen as optimistic 
Copeland winner. Note that is simply a Bernoulli random variable mean 0.5. 

Agt the number of time-steps between Tg and T when arm was chosen as optimistic Copeland winner and that 
Line 13 in Algorithm|llchose to pick an arm from Bf^^^ to be played against a^. Note that this definition implies 
that we have 

N' 

n—1 

Moreover, by Fact|^above, we know that with probability 1 — 5/4 we have 

Nh<%--= E 

Now, we will use the above high probability bound on to put the following high probability bound on IVL with 
probability 1 — 5/2 we have 

iV* </V* :=2iV^-f 2y^ln^. 


( 8 ) 

(9) 
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To do so, let us assume that the we have > N'^ and consider the first time-steps after Tg /2 when arm was 
chosen as optimistic Copeland winner and note that by Equation ([^ we have 


n—1 

and so by Inequality with probability 1 — 5/A the left-hand side of the last inequality is bounded by TVg: let us 
denote this event with £. On the other hand, if we apply the Hoeffding bound (cf. Appendix [D|) to the variables 
we get 


P[£ A > N' 


/ AT* 


<m 


t n—1 


= p[^i?;<ivV2 

\ n^l 







2K\‘^ 
S ) 


\N^b + 





( 10 ) 


To simplify the last expression in the last chain of inequalities, let us use the notation a := fVg and /3 := In Given 
this notation, we claim that the following inequality holds if we have a > A and /? > 2 (which hold by the assumptions 
of the theorem): 


^ no >P- 
a + g/ap 

To see this, let us multiply both sides by the denominator of the left-hand side of the above inequality: 

q;/3^ > a/3 + y/al3. 


( 11 ) 


( 12 ) 


To see why Inequality ([T^ holds, let us note that the restrictions imposed on a and /? imply the following pair of 
inequalities, whose sum is equivalent to Inequality ( fT^ : 

a/3^ > 2a/? 

+ a/3^ > 2^/a/3^ 

= 2a/3^ > 2a/? -|- 2y/aj3^ 


Now that we know that Inequality ( [TT] l holds, we can combine it with Inequality ( [T0| l to get 

2K 

P (S A = —. 

V / - 2K 


Taking a union over the non-Copeland winning arms, we get 

P{£ A yi> C, > N^) > 1- 5/2. 


So, given the fact that we have P{£) < 5/A, we know that with probability 1 — ^ each non-Copeland winner is selected 
as optimistic Copeland winner between Tg /2 and T no more than TV® times. □ 
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H A Scalable Solution to the Copeland Bandit Problem 


In this section, we prove Lemma providing an analysis to the PAC solver of the Copeland winner identification 
algorithm. 

To simplify the proof, we begin by solving a slightly easier variant of Lemma 14 where the queries are determin¬ 
istic. Specifically, rather than having a query to the pair {ai,aj) be an outcome of a Bernoulli r.v. with an expected 
value of pij, we assume that such a query simply yields the answer to whether py > 0.5. Clearly, a solution can be 
obtained using K{K —1)/2 many queries but we aim for a solution with query complexity linear in K. In this section 
we prove the following. 


Lemma 21. Given K arms and a parameter e, Algorithm^finds a (1 + e)-approximate best arm with probability at 
least 1 — 5, by using at most 


\og{K/d) ■ O 


^Ariog(Ar) + min 




cpld(ai)) 


many queries. In particular, when there is a Condorcet winner ('cpld(ai) = Ij or more generally cpld(ai) = 1 — 
0{1/K), an exact solution can be found with probability at least 1 — 5 by using at most 

O {K\og{K) \og{K/d)) 


many queries. 

The idea behind our algorithm is as follows. We provide an unbiased estimator of the normalized Copeland score 
of arm by picking an arm a_, uniformly at random and querying the pair (oi, af). This method allows us to apply 
proof techniques for the classic MAB problem. These techniques provide abound on the number of queries dependent 
on the gaps between the different Copeland scores. Our result is obtained by noticing that there cannot be too many 
arms with a large Copeland score; the formal statement is given later in Lemma[T^ If the Copeland winner has a large 
Copeland score, i.e., Lc is small, then only a small number of arms can be close to optimal. Hence, the main argument 
of the proof is that the majority of arms can be eliminated quickly and only a handful of arms must be queried many 
times. 

As stated above, our algorithm uses as a black box Algorithmic an approximate-best-arm identification algorithm 
for the classical MAB setup. Recall that here, each arm Oi has an associated reward p,i and the objective is to identify 
an arm with the (approximately) largest reward. Without loss of geenrality, we assume that /ii is the maximal reward. 
The following lemma provides an analysis of Algorithm |C that is tight for the case where is close to 1. In this case, 
it is exactly the set of near optimal arms that will be queried many times hence it is important to take into consideration 
that the random variables associated with near optimal arms have a variance of roughly 1 — p,i, which can be quite 
small. This translates to savings in the number of queries to atm by a factor of 1 — p.i compared to an algorithm 
that does not take the variances into account. 

Lemma 22. Algorithm |C requires as input an error parameter e, failure probability S and an oracle to k Bernoulli 
distributions. It outputs, with probability at least 1 — 5, a (1 -f e)-approximate best arm, that is an arm ai with 
corresponding expected reward of ji > 1 — (1 — /ri)(l-|-e) with p.i being the maximum expected value among arms. 
The expected number of queries made by the algorithm is upper bounded by 

q(y^ Bi)^og{K/{SAie)) 

with A| = max{^i — /i^, e(l — p-i)}. Moreover, with probability at least 1 — <5, the number of times arm i will be 
queried is at most 

(1 - p.i)\og{K/{5A^e)) 
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We prove Lemma [2^in Appendix|I] 

For convenience, we denote by /i^ the normalized Copeland score of arm ai and p,i the maximal normalized 
Copeland score. To get an informative translation of the above expression to our setting, let A be the set of arms with 
normalized Copeland score in (1 — 2(1 — ^i), pi] and let A be the set of the other arms. In our setting, this query 
complexity of Algorithm|^is upper bounded by 


( 2\A\\og{K/5) log(jT/^)(l-^,) \ 


assumin^iJ < (1 — /ri)e. 

It remains to provide an upper bound for the above expression given the structure of the normalized Copeland 
scores. In particular, we use the results of Lemma [TS] repeated here for convenience. 

Lemma [T^ Let D C [K] be the set of arms for which cpld(ai) > 1 — d/{K — 1), that is arms that are beaten by 
at most d arms. Then \D\ < 2d 1. 

We bound the left summand in ( |T3] i: 

2\A\\og{K/5) (4(l-pi)(jL-l) + 2)log(jL/J) ^ ( \og{K/5)K 

(l-Mi)e" ■ (l-Fi)e" V e' 



We now bound the right summand in ( [T3] l. Let i G A. According to the dehnition of A it holds that (1 — ^i) < 
2(^1 — fii). Hence; 

log(Ar/^)(l — fj,i) ^ A\og{K/5) 

^ (mi - Ti? “ ^ 1 - Mi 

i£A i£A 

Lemma 23. We have ^ ^— = 0{K\og{K)). 

z: M.<1 


Proof Let Ar be the set of arms for which 2'^ < \ — [li < 2’’+^. According to Lemma 15 we have that lA^-l < 
2 t+2(^ _ 1 ) _I_ 1 Other than that, since 1 > 1 — /ii > l/(Ar — 1) for alH > C we have that Ar = % for any 
T < — log^iK — 1) — 1 and r > 0. It follows that; 


[logs(if-1)1 I . I riog2(i<--l)l 2+r , 1 

^ l^^-log2(g-l)l ^ 2 +1 

2^ 1 — u, “ 2^ — 2^ 

i>C ^ 1=0 £=0 


2«-iog2(tf-i) - 2^-'°g2(^-i) 

£=0 

< ([log2(K-l)]+l)-5(K-l). 

From ( |T3 ] i, ([T^ and Lemmawe conclude that the total number of queries is bounded by 

oflog(K/S) fKlog(K) + ^ 


□ 


In order to prove Lemma 21 it remains to analyze the case where e is extremely small. Specifically, when e^(l — p-i) 


takes a value smaller than 1/K then the algorithm becomes inefficient in the sense that it queries the same pair 
more than once. This can be avoided by taking the samples of j when querying the score of arm to be uniformly 
random without replacement. The same arguments hold but are more complex as now the arm pulls are not i.i.d. 
Nevertheless, the required concentration bounds still hold. The resulting argument is that the number of queries is 


6 (log(l/(5) (a: + ^)) with e = max{e, 1/ (^s/K{l - pi)^ }. 
We are now ready to analyze the stochastic setting. 


Lemma 


21 


immediately follows. 


^The value of <5 we require is 1/T. If the assumption does not follow in that case, the regret must be linear and all of the statements hold 
trivially. 
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Proof of Lemma\^ By querying arm Ui we choose a random arm j i and in fact query the pair (oi, Oj) sufficiently 
many times in order to determine whether pij > 0.5 with probability at least 1 — 5/K^. Standard concentration bounds 
show that achieving this requires querying the pair (at, aj) at most O (log(iir/(Ay(5))A“-^) many times. It follows 
that a single query to arm in the deterministic case translates into an expected number of 

many queries in the stochastic setting. The claim now follows from the bound on the expected number of queries given 
in Lemma l2n □ 


I KL-based approximate best arm identification algorithm 

Algorithm solves an approximate best arm identihcation problem using conhdence bounds based on Chemoff’s 
inequality stated w.r.t the KL-divergence of two random variables. Recall that for two Bernoulli random variables 
with parameters p, q the KL-divergence from g to p is dehned as d{p, q) — {1 — p) ln((l — p)/(l — q)) + p\n{p/q) 
with Oln(O) = 0. The building block of Algorithmic is the well known Chernoff bound stating that for a Bernoulli 
random variable with expected value q, the probability of the average of n i.i.d samples from it to be smaller (larger) 
than p, for p < q(p > q), is bounded by exp{—nd{p, q)). 


Algorithm 4 KL-best arm identihcation 


Input: Access to oracle giving a noisy approximation of the reward of arm i for K arms, success probability 5 > 0, 
approximation parameter e > 0 

for all i e [K] do 

T = 1 

Si ^ reward(i) 

f ^ [ 0 , 1 ] 

end for 

B^[K] 
t ^ 2 

while 1 --r > (1 + e) do 

1—maxigR max ii k ) 

For all i & B, Si Si + reward(z) 


For all i G B, let f = {q G [0,1], t ■ q) < ln(4tiT/(5) + 2 Inln(f)} 

For alH S i? for which there exist some j G B with max{q G f} < minjg G f}, remove i from B. 
t i — t f 

end while 


Return: arg maxj g b min li. 


Proof of Lemma^^ We use an immediate application of the Chernoff-Hoeffding bound 

Lemma 24. Fix i G [K], Let El denote the event that at iteration t, pi ^ li. Bfe have that Pr[ii^(] < 2^^ ■ < 

2t log(t)2if • 

Let E denote the union, over all t, i of events Ef That is, E denotes the event in which there exist some iteration 
t, and for some arm Oi such that pi ^ f. By the above lemma we get that 


P^E] < 


2t\og(t)^K 


< S 


It follows that given that event E did not happen, the algorithm will never eliminate the top arm and furthermore, will 
output an (1 -f e)-approximate best arm. We proceed to analyze the total number of pulls per arm, while having a 
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separate analysis for (1 + e)-approximate best arms and the other arms. We begin by stating an auxiliary lemma giving 
explicit bounds for the confidence regions. 

Lemma 25. Assume that event E did not occur and let p > 0. For a sufficiently large universal constant c we have 
for any t > max f < pi p. Also, for t > p holds that min f > p — p. 

Proof We consider the Taylor series associated with f{x) = d{p + x,p). Since /(O) = f'(Q) = 0 it holds that for 
any x < 1 — p there exists some |a;'| < |a:| with 


fix) = x^fix') 


< 


2x^ 


(p + x')(l — p — x') 1—p 


To prove that max f < pi + p we apply the above observation for p < 1 — Pi (otherwise pi + p > 1 and the claim 
is trivial) and reach the conclusion that for sufficiently large universal constant c it holds that 


t ■ d{pi + pl2,pi) > \og{tK/S) + 2\og\og{tK/S) 


t ■ d{pi + p/2, pi + p)> \og{tK/5) + 2 log \ogitK/5) 

The first inequality dictates that Si/t < pi + p/2. The second inequality dictates that t ■ diSi/t,pi + p) > dipi + 
p/2, Pi + p) is too large in order for + p to be an element of f. 

The bound for min/i is analogous. Since now we have t > ^ p holds that 

t ■ d{pi - p/2, Pi) > \og{tK/6) + 2\og\og{tK/S) 


t ■ d{pi - p/2, Pi- p) > logitK/S) + 2 log \og{tK/S) 

This means that first, Si/t > pi — p/2 and second, that t ■ d{Si/t, Pi — p) > dipi — p/2, pi — p) is too large in order 
for Pi — p to be an element of Zj. □ 

Lemma 26. Let i be a suboptimal arm, meaning one where pi < 1 — (1 — pi)(l + e). Denote by its gap pi — pi. 

/ log / \ 

If event E does not occur then i is queried at most O I — ^ — 1 many times, where Vi = 1 — pi 

Proof We hrst notice that as we are assuming that event E did not happen, it must be the case that arm 1 is never 
eliminated from B. Consider an iteration t such that 


^ ^ c\og{tK/6)vi 

- ( A .)2 


(15) 


for a sufficiently large c, then according to Lemma 25 it holds that max < pi + Ai/2. Now, since Vi = 1 — pi > 
1 —Pi+Ai/2 we have that for the same f it must be the case that min/i > pi —Ai/2. It follows that min/i > max/^ 
and arm is eliminated at round t. □ 


Lemma 27. Assume e < 1. If event E does not occur then for some sufficiently large universal constant c it holds that 
when t > the algorithm terminates. 

Proof Let i be an arbitrary arm. Since 

c\ogitK/5) ^ clogitK/5)il - Pi) 

- (l-pi)e2 - (l-pi)(l-p,)e2 

we get, according to Lemma [25] that 

max /i < Pi + 1 1/(1 - pi)(l - pi) 
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In order to bound 


— — /ii) we consider the function f{x) = \/v{v + x). Notice that /(O) = v and 

< I for a: > 0. It follows that for positive x, \Jv{y + x) <v + x/2, meaning that 


max/i < fii + 


e((l 


fii) + A,/2) , e(l - Ml) 

- < Ml +-5- 


Now, since e < 1 we have 


clog{tK/6){l - fii) (c/2) log{tK/6){l - m + e(l - m)) 

(l-/ii)2e2 - (1-Mi)^e2 

hence for sufficiently large c we can apply Lemma p5] and obtain 

■ r ^ e(l-Mi) 
min/i > Pi- - - 

It follows that assuming e < 1, 

min/i > 1 — ^1 — max/i^ (1 + e) 

meaning that the algorithm will terminate at iteration t. □ 

This concludes the proof of Lemma|^ □ 
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Table 2: List of notation used in this paper 


Symbol 

Definition 

K 

Number of arms 

[K] 

The set {1,, K} 

oi,..., ax 

Set of arms 

Pij 

Cpld(ai) 

Probability of arm ai beating arm aj 

Copeland score: number of arms that at beats, i.e. \{j \ pij > 0.5}| 

cpld(ai) 

XT r A r- 1 u Cpld(ai) 

Normalized Copeland score: ^ ^ 

C 

Number of Copeland winners, i.e. arms d with Cpld(ai) > Cpld(aj) for all j 

ai,... ,ac 

Copeland winner arms 

a 

UCB parameter of Algorithmlll 

s 

u 

Probability of failure 

CiS) 

V (2a-1)5 ) 

N,it) 

N!{t) 

Nij{t) 

Number of times arm ai was chosen as the optimistic Copeland winner until time t 

Number of times arm ai was chosen as the optimistic Copeland winner in the interval t] 

Total number of time-steps before t when ai was compared against aj (notice that this definition is symmetric with 
respect to i and j) 

Nfjit) 

Number of time-steps between times C{S) and t when Ci was chosen as the optimistic Copeland winner and aj as 
the challenger (note that, unlike Nij{t), this definition is not symmetric with respect to i and j) 

Tij 

The last time-step when was chosen as the optimistic Copeland winner and a_, as the challenger (note that 

n, >(7(5) iffiV/^(f) >0) 

Wij (t) 

Number of wins of ai over aj until time t 

Uij{t) 

Wij(t) , j a\nt 

N^j{t) + V N,j{t) 

Uj (t) 

1 - Uji{t) 

Cpld(ai) 

Cpld(ai) 

# {fc 1 Uife > \,k 
#{k\lik> \,k^ i} 

Ct 

{i 1 Cpld(ai) = maxj Cpld(aj)} 

r. 

the set of arms to which ai loses, i.e. aj such that pij <0.5 

Lc 

The largest number of losses that any Copeland winner has, i.e. maxJLi | {j \ pij < 0.5} | 

Lc 

Algorithmj^s estimate of Lc 

Bt 

The potentially best arms at time t, i.e. the set of arms that according to Algorithmhave some chance of being 
Copeland winners 

Bl 

The arms that at time t have the best chance of beating arm (Cf. Line 12 in Algorithmf^ 

Aij 

\pij - 0.5| 

^min 

min{Aij Aij ^ 0} 

i* 

the index of the {Lc + 1)*^ largest element in the set {A^- | pij < 0.5} in the case that i > C 

A* 

f An* ifi>C 

1 0 otherwise 


Jz 














Table 3: List of notation used in this paper (Cont’d) 

Symbol 

Definition 

A*- 

I At + Aij ifpy>0.5 

1 max{A*,Aij} otherwise 

(See Figuresl^andl^for a pictorial explanation.) 

A^^in 

. , * u u 

mm Ai 
i>C 

N!j{T) 

0 if i = y and i > C 

n!{t) 


n\t) 

^iV^.(T) + l 

Ts > 

C{T) + 8K^{Lc + 1)^ In ^ In 6|f 

_^32aXCLc±i) YnTs + N^/^{Ts) 

+AKmixKi>cNt'^{Ts) 

Ts is the smallest integer satisfying the above inequality (Cf. Definitionj^. 

To 

C{5/2) + N^/^{Ts) 

, 32aK(Lc + l)lnTs 

A^in 

+87i:'*(Lc + l)^ln^ 

rib 

2KN^^‘^{fs) + 

Binom{n,p) 

A “binomial” random variable obtained from the sum of n independent Bernoulli random variables, each of 
which produces 1 with probability p and 0 otherwise. 

A. 

max|cpld(ai) — cpld(ai), ^flri j" 




maxi Hi 

Af 

max {Ai, e(l — cpld(ai))} 
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